Okay. Does anyone want to see uh Steve's feedback from the specification? Right. Not really, um just what he's talking about, like duplication of effort and Like duplication of effort and stuff, and um yeah, he was saying that we should maybe uh think about having a prototype for week six, which is next week. Yeah. So we should probably prioritize our packages. Mm. Yeah. Yeah. Hmm. Has has anyone actually looked at the Java code for the, huh? Hmm. Yeah, I think so. Yeah, I I don't know about the search functionality, that might be online. Depends how it's gonna work. Yeah. Mm-hmm. Yeah, that makes sense. Hmm. Hmm. Yeah, you just concatenate them together. Hmm. Yeah. It just means it loads on demand. It only loads when it needs a particular type of file. Like when it's being accessed. Yeah, I think that's the idea, it just loads the particular ones it needs. But if you were doing a search over the whole corpus you'd have to load them all. Hmm. Mm. Hmm. Yeah, we do not want it in to develop a little tree display as well for multiple results. Yeah, but that'd be quite easy to do. You just need to find the time stamp. Yeah. Yeah. Yeah, I think I think those segments for each utterance are split up. Think so. Yeah, I'm pretty sure it's already there. Pretty sure that's already there. The the utterances are numbered. Hmm. Yeah, I think so. Ye that's the impression I get, yeah. Oh. Hmm. Ye Mm. Yeah, uh Right. Okay. Topics, yeah. Yeah, I think that's the right one. Hmm. Hmm. Mm-hmm. Mm. Hmm. Yeah, that'd be much more efficient to do that. Yeah. Hmm. Hmm. Yeah, you're able to do that in Java, yeah? Yeah. Huh. Hmm. Yeah, I've had a b I've had a look at the the topic segments, how it's stored. And then yeah, th those are few per meeting, and it um well, it gives a time stamp and inside each one there's uh the actual like utterance segments. And the list of them that occurred. And they're all numbered. Um so that's where that's stored. Yeah, so I guess um if I'm gonna be segmenting it with a L_C_ seg then that's like same format I'd want to um put it back out in so it'd be equivalent. Well, like the integration. What do you mean, integration? Hmm. I don't know. I don't think anyone's been allocated to do that yet. Yeah, yeah. Yeah, definitely. Hmm, yeah. Yeah, it c could be difficult, yeah. Yeah. Well I guess the important thing is to get the crucial m modules built. Ye yeah. Yeah, and then Yeah, and then we'll maybe have to prioritize somebody into just integrating it. Mm-hmm. Yeah, I think so. Uh yeah. Hmm. Yeah, yeah. Jasmine, I thought you just said that you'd uh looked at extracting the text. Yeah. So you you said you did it in Python, yeah? Yeah, did you use uh b the X_L_ uh X_M_L_ parser in Python? Right. Yeah, sounds pretty good. So um 'cause, yeah, I was having a look in it a look at it as well and I noticed the um the speakers are all in that separate file? So did did you have to combine them all and and then re-order them? Yeah. Ye yeah, c Right. Yeah, so that's approach um well, I was going to do. So yeah, we may as well collaborate. In the word files? I'm not sure I what you mean. Oh right. Hmm. Hmm. Mm I thought they were local to th a particular meeting. Hmm. Mm is there anything else we should discuss? Yeah, should we not have like a group directory or something where we can put all our code in and that kinda thing? Hmm. I've gotten mm hardly any Hmm. Yeah, we can ask Steve if um we can get space. Yeah, uh we could do that. Yeah, I'm sure he had to deal with that last year. Yeah. Hmm. That sounds good. Hmm. Hmm. Yeah, that's what I'm gonna need. Yes. Yeah, it's just mo changing it a bit. Yeah. No, but uh that's what M_L_C_ seg does. It it marks the end of each segment. Yeah. Yeah. Oh. Yeah. Yeah, for me it's better if they're by meeting. Then that'll be really easy to do once they've got the raw text. It's just a case of running the script. Yeah, I mean hopefully this week. Alright. And we could Don't know. Suppose we're just getting on with all our components. So I know. Wa Yeah. Yeah, he suggested that we could have an uh initial prototype. I know, I'd b I'd be surprised if we can get anything working by next week. Alright.
Yeah. Yeah, I mean if we just want to have um some data for the user face, could even be random data. Uh mm mm Yeah, I'm Hmm. Yes. Hmm yes. Hmm. I'm not so sure. I I thought we would just have like um one big summary um with all the uh different importance levels um displayed on it. And depending on what our um zoom level is, we just display a part of it. And we would have one very big thing off-line. And from that we would just select what we are displaying. Yes. So for example you would um give a high value to those um sequences you want to display in the meeting series summary. And you just cut off. That was what I sh I thought, yeah. I thought. But I think the m difference might be that we want just want to have um the words. And that's not so much what he meant with not possibly loading everything was that you m um load all the uh annotation stuff, all the sound files, all In Um I r I I'm getting quite lost um at the moment because um w what's um our difference between the um se um uh the importance measure and the skimming? I mean, do we do both or is it the same thing? Okay. So but when when we talk about summaries you talk about this uh abo about skimming and not about Yeah. Yeah right, isn't that the skimming? Isn't that the skimming? Yeah, but it use the same data. Yeah. A And, yeah, I think we also thought about combining that measure with um the measures I get from um s uh hot-spots and so on. So that would also be on utterance level, I think. I think. Yes, sure. Yes. Yes, right. Oops, it does. So I define baseline and what it loads? For example it loads all the utterances and so on, but it doesn't load um the discourse acts and for example not the and what's what else there? Not the summaries. It only loads those on demand. Y you mean that you um basically split up th the big thing into um different summaries. For example that you have a very um top-level um summary and a separate file for for each level. Mm-hmm. Yes. N Uh no no, it's f for No, you're right. Yeah. It's for Um No, I I think we would just take the segments that are already that were Yeah, there's um this segments file. Um you know, the X_M_L_ segments. Oh. That I don't know. Yeah, that's um Mm-hmm. There there are time stamps um for, well, segments um and for th um segments is for example when when you look at the data, what is displayed in one line. What when when you look at it in the hmm? I think so. Isn't Um for ex um I I compared it with what I did for the pause um duration extraction. Um and basically it's uh words that are uttered in a sequence without pauses. But sometimes um however there are um short pauses in it and they're indicated by square brackets pause or something in the data. Um someti uh but uh the annotators decided what was one segment and what wasn't. I think so. Yeah, but um I think for some annotations um an uttera ca utterance can have several um types. For example for the dialogue acts and so on. Okay. Yeah, that should be for Yeah. Should be, yeah. Yes, but that's Yeah, everything that's a word has a sti time stamp. That's at the end. That's at the end, I think, her time. Yeah, maybe. Didn't have a look at our meetings. Uh I I think it wouldn't as it occurs I mean it would be it occurs in every meeting. So And I think it even has uh its own annotation, like digits or something. So that should be really easy to cut out. Yeah. I'm sure. Ah it's just to test the system, I think. So Mm they have to read numbers from Uh I didn't have a look at that. So They Mm-hmm. Uh th yeah. 'Kay. Um I just um wondered, so who's uh then doing um the frequencies on on the words, because I'm I think I will also um I could also make use of it um for the agreement and disagreement thing. Because I um I in my outline I talked about um using the um discourse acts first, and um then in the chunks of text I found looking for word patterns and so on. So um I would for example need the um most freq um frequent words. So if you cut off all that, I'd won't be use or Yeah, I I but I need it for my chunks then. I would You know? Yeah, but I'd uh I would like to look at the frequency of words in my um in the regions of text I found out to be interesting. So I wouldn't need it. It it would have to be re-calculated only for my segments. Huh? Uh uh mm. I think it would be, you know, l as as big at as the hot-spot annotation things. That's quite small, yeah, that's some utterances. Yes. Yeah, yeah. So I would probably just concatenate all my um text chunks and then let's say m I will run over it. Yes. Yes, definitely. Yeah, right. Ye M Um Jasmine, uh um what is um the text you're extracting uh looking like then at the end? Because um I I think it's actually very similar to what I did for my um speaker um uh extraction and I think I would ch perhaps have to change two lines of codes to get you um for each meeting a file that says fr from um this millisecond to this millisecond there was this sequence of words. And so on. So that's just changing two lines of code. And it would give you that. So Um yeah. So far I extracted um the dura durations. But it's from the words file. So I could just um contatenate concatenate um the words instead of the durations, and it should I mean Should be very straight-forward. I can try to do it and send it to you. Pe and you have a look at it, will it make sense for what you want. Yeah, uh p I mean it I just let it run over all the files. So Yes. I just ordered. Uh I ordered according to the um starting times of the utterances. What do you mean by diffe Yeah, I mean t I I have one what I give you would be one file for each meeting. Yeah, not for each meeting series. I didn't do that yet. Yeah, one group, yeah. Yeah, I mean there's one series that has just one meeting. Yes. Um the you you the data is of the form you have um three identification letter. So B_E_D_ or B_B_D_ or something, and that's always the same group. And then after that there's um a number like O_O_ one, O_O_ two. So, it's a Yeah, but that's that's really quite easy to see because they're named. Yes. But I I mean as um the start uh start times um start for each meeting at zero, you could just probably just um add the um final second time to the next meeting and so on and just put it all together. But then we would have to change um the information about who on which channel it was set, um to by which person it was set. And that is actually stored in another X_M_L_ document. Yeah, I w would then just not print out the um start and end times. No, it's for every single word. Or for every single utterance. Yeah, that depends on what you want. Yeah, but I do it with Perl, it's just string manipulation. So I would I mean I would just Sure. No, I didn't do a sea no. And you would want that all in one file for all the corpus? Or For the series. Yeah, I can directly put it into uh just like So uh only words um per meeting series. Uh-huh. Yes. Yeah, they will just I will just take I would uh take over the names they have anyway. Yeah, yeah. Yeah, one series has the um same three starting letters. So So only words and words and times. And you Yeah, you want it ordered. Okay. Okay, anybody Um ord base dot times. Yeah, and do you want Yeah, sometimes they're contained in one another. So Just after th mm-hmm. 'Kay. Ordered. Only words. Um and I think um for all the corpus, it's just from I know from other times, it's um nine megami byte to have I mean should be should be similar to have the words. Should be. Na um all the words together um for all the meetings. That's what I'm guessing that's, you know, um what I because nine mega-byte is what I got for when I said for every um utterance, this is goes from there to there and takes takes seconds. Oh. Yeah, I mean I'm it doing it for all of it. Doesn't matter. Yeah, I mean I hope it will be the same for the words. It's just what I I Mm-hmm. Mm. So so um I will probably send um just one file of the first meeting um to all those who need it so that you can have a look whether that's what you want. Yeah, I mean if it's just for one meeting, it's really not too big. Yeah. What do we have to demonstrate?
The basic word importance is off-line as well. The combined measure might not be if we want to wait what the user has typed in into the search. Yeah. I'm not quite so what it did you want to do it, i you just wanted to assign Uh I thought about words. Mm. Mm okay. Yeah, but how about those words which don't carry any meaning at all, the um and uhs and something like that. Because if we if we average average over over a whole utterance all the words, and there are quite unimportant words in there, but quite important words as well, I think we should just disregard the the Okay. Alright. Yeah. But there is no I_D_ for an utterance I think. It's just for individual words. So how do we do that then? We for utterances as well. I think it's just for one word. So we have to Yeah. Uh I'm not quite sure, I have only seen that the uh the individual words have got an I_D_. Yeah. You always could have a look at the time stamps and then take the ones that uh belong together to form an utterance. Yeah, if they are already, there's it's easy but it would be possible. Uh yeah. Okay. You s uh you said you are currently in uh implementing the idea. What exactly are you computing? Okay. Okay. Mm-hmm. Mm-hmm. Yeah, I w I w I would need the raw text pretty soon because I have to find out um how I have to put the segments into bins. And then yeah. No, that's not necessary. Yes, I did. But um I've only just got the notes. I have to still have uh to order everything by the time and Yeah, I think it's quite easy after the Yeah. Yeah. So uh Mm-hmm. Yeah, b I uh w that's what I was uh thought. That you just combine them and then order the time stamps accordingly. Okay. Um what I found out was that there are quite a lot of things without without s time stamps in the beginning. Yeah, and uh X_M_L_ files. Yeah, that's just an I_D_ or something. I don't know. Just numbers. Yes, but what are the other things that's uh some kind of number? F maybe the file number or something that is in the beginning. What is that? Do you know? Um I think there are quite a lot of numbers in the beginning where n there is no time stamp for the numbers. It's Think they say um quite a lot of numbers and before that, uh um there's this number. Was it Yeah, there i are numbers in the um the W_ tag, but there are no time stamps. Yeah. Yeah, in the beginning as well sometimes, I think. At least I saw some. Yeah. Yeah. But what it is it actually that numbers? Okay, so but there are no time stamps annotated to that. It's it's quite strange. And also um there are different um combinations of letters. B_R_E_ and something like that. Is it everything ordered are the time stamps global or uh are they local at any point? Okay. Yeah, it's Rainbow. It's um I think it's just the dictionary in the first place. But Um no, I have to bin it up and so I will only have counts for each each bin or something. It's because um Rainbow is a text classification system. And I think it's not possible to have just one class. That's the problem. Maybe we could Yeah sure, you sure, we could do that, but I don't that makes sense. If we need just frequencies, maybe we should just calculate them by using Perl or something. I don't know. Yeah, it's quite easy to just count and s or sort them by um frequency. Just using a Perl script. Is it too big? Yeah. Hmm. I don't know how you how many terms you can handle in Perl. Mm yeah. Uh I can get all the raw text, but it has to be ordered still. So No, it isn't. Um it's in what is implemented in Rainbow is information gain, and I'm not quite sure how they calculate that. Yeah. Uh that's what Rainbow does. I think you j can just get probabilities for a certain words for each document. Certain Um we would have to look at that. Mm-hmm. Oh. Yeah, that's what I thought as well, that you that probably the the topic segment level is the most um informative for the words. Yeah, that's the problem. I don't know. Mm-hmm. So shall we sit together tomorrow then as well? Uh Okay. Um, yeah, w would it be best? At the moment it's it's just lines of Mm-hmm. Um Okay. So um you'd do you extract the words, the raw text, as well? Uh Okay. Mm-hmm. Print out. Okay. Okay, that Okay. So have we already extracted from all the files? Yeah. Did you also order Mm-hmm. Hmm. Hmm. Okay. Uh I don't need the times, I just need the words. But um Yeah, in the right order. Yes. Yeah, that doesn't matter too much, I think. Hmm. Mm-hmm. How long would it take to make the frequency counts with a Java hash table? Yeah. No, how long you would have to program something. Okay. Mm. Because it's quite easy in Perl as well, it's just a line of code for counting all the words and yeah, it's it's by hashes. Yeah. Yeah. 'Kay.
I I dry-read it the last time.. Next week. Yeah. Yeah. No. Uh mine's gonna be mostly using the off-line. But the actual stuff it's doing will be on-line. But it won't be very um processor intensive or memory intensive, I don't think. Don't think so. Yeah. Are we still gonna go for dumping it into a database? Are we still gonna dump it into a database? 'Cause if we are, I reckon we should all read our classes out of the database. It'll be so much easier. Well if we're gonna dump the part of it into a database anyway, we might as well dump all the fields we want into the database, calculate everything from there. Then we don't even have to worry that much about the underlying X_M_L_ representation. We can just query it. Well if we're gonna do that, we should try and store everything in in an X_M_L_ format as well. Yeah. Yeah. Well we don't even need to do that, 'cause if we got our information density calculated off-line, so all we do is treat the whole lot as one massive document. I mean they'll it's not gonna be so big that we can't load in a information density for every utterance. And we can just summarise based on that. I think you can do it on-line. I don't think there's really much point in doing like that when it's just gonna feed off in the end the information density measure basically. And that's all calculated off-line. So what you're really doing is sorting a list, is the p computationally hard part of it. Well like the ideas we're calculating are information density all off-line first for every utterance in the whole corpus, right? So what you do is you say if you're looking at a series of meetings, you just say well our whole document comprises of all these stuck together. And then all you have to do is sort them by j information density. Like maybe weighted with the search terms, and then extract them. I don't think it's too slow to do on-line, to be honest. Is that Yeah. Well, on the utterance level I was thinking. So the utterances with the highest like mean information density. Well the trouble with doing it on the word level is if you want the audio to synch up, you've got no way of getting in and extracting just that word. I mean it's impossible. For every single word? Oh, okay. Yeah. I don't think that will do it. We'll have to buffer it. Well the skimming's gonna use the importance. But like at first it's just gonna be I_D_F_. Well mostly skimming, yeah. Yeah. Well the nice thing about that is it will automatically be in sentences. Well more or less. So it will make more sense, and if you get just extract words. Yeah. I see it. But it'll need to be calculated at word level though because otherwise there won't be enough occurrences of the terms to make any meaningful sense. Yeah. Yeah, I reckon you can just mean it over the sentence. I think we should filter them. Maybe we should have like um a cut-off. So it a w word only gets a value if it's above a certain threshold. So anything that has less than say nought point five importance gets assigned to zero. Yeah, that's the other th Yeah. I think we'll have to buffer the audio. But I don't think it will be very hard. I think it would be like an hour or two's work. Like just build an another f wave file essentially. Yeah, I mean I bet there would be packages In memory, yeah. So just like unp there's bound to be like a media wave object or something like that. And just build one in memory. I don't know. I have no idea. But it must have like classes for dealing with files. And if it has classes for concatenating files, you can do it in memory. So Well what I think I might try and build is basically a class that you just feed it a linked list of um different wave-forms, and it will just string them all together with maybe, I don't know, tenth of a second silence in between each one or something like that. Normalise it, yeah. Oh yeah, yeah, we'll need that. We also really wanna be able to search by who's speaking as well. It doesn't matter, 'cause all the calculation's done off-line. That's easy. You just like create a new X_M_L_ document in memory. I don't think it's really that much of a problem because if it's too big, what we can do is just well all the off-line stuff doesn't really matter. And all we can do is just process a bit at a time. Like for summarisation, say we wanted a hundred utterances in the summary, just look at the meeting, take the top one hundred utterances in each other meeting. If it scores higher than the ones already in the summary so far, just replace them. And then you only have to process one meeting at a time. Okay, so maybe we should build a b store a mean measure for the segments and meetings as well? And speaker. Speaker and um topic segmenting we'll need as well. Yeah. Well yeah, and then it'll f preserve the order when it's displayed the Yeah. Yeah. Yeah, I think so. So we should basically make our own X_M_L_ document in memory that everyone's um module changes that, rather than the underlying data. And then have that X_M_L_ uh NITE X_M_L_ document tied to the interface. Well, you can make it in a file if you want. Mm-hmm. They are utterances, aren't they? The segments are utterances, aren't they? Yeah. Alright, okay. Well, that's easy. Well it's close enough, isn't it? It may not be exact every time, but it's a so sort of size we're looking for. Yeah, yeah. Yeah. But why don't we just write it as a new X_M_L_ file? Can NITE handle just loading arbitrary uh new like attributes and stuff? I mean, I would have thought they'd make it able to. Yeah. So why do we need to have two X_M_L_ trees in memory at once? The other thing is that would mean we'd be using their parser as well, which means we wouldn't have to parse anything, which be quite nice. 'Cause their parser is probably much faster than anything we've come up with anyway. Yeah, I mean we can process it in chunks if it gets too big basically. We can just process it all in chunks if it gets too big to load it into memory. I think we probably want to store Sorry. I think we probably want to store um a hierarchical information density as well. So like an informan mation density score for each meeting and each topic segment. 'Cause otherwise we'd be recalculating the same thing over and over and over again. Yeah. And that will obviously make it much easier to display. Well it may not for the whole meeting, but like Yeah, exactly. Yeah. Well, we can start off like that. Well I was gonna start off I've v got sort of half-way through implementing one that does just I_D_F_. And then just I can change that to work on whatever. Yeah. And it should be weighted by stuff like the hot spots and um the key-words in the search and stuff like that. Did he not say something about named entities? So I thought he said there wasn't very many. Yeah. Yeah. It's not T_F_I_D_F_, it's just inverse document frequency. 'Cause it's really easy to do basically. There's just like for a baseline really. Well, I'm half-way through. It's not working yet, but it will do. Um yeah. And then averaging it over the utterances. But it's not like um related to the corpus at all. It's just working on an arbitrary text file at the moment. No. It would be useful to know how everyone's gonna store their things though. Yeah. Yeah. Well I've got like a few hours free. Like after this. It's the most boring task. Yeah. Or at least um simple versions of them. So maybe we should try doing something really simple, like just displaying a whole meeting. And like just being able to scroll through it or something like that. Yeah. Are you free after this? How about Friday then. 'Cause I'm off all Friday. Uh Wednesday I've got a nine 'til twelve. Yeah, nothing in the afternoon. I've got nothing in the afternoon. So Okay. So you ha yeah. Where about, just in Appleton Tower? Uh I'll be in um the Appleton Tower anyway. Um well I'll be there from twelve. I've got some other stuff that needs done on Matlab, so if you're not there at twelve, I can just work on that. So Yeah. Why w Yeah. I'm just building a dictionary. Oh, mine's just gonna use the um hash map one in um Java. 'Cause I'm only gonna do it on small documents. It's just like bef until the information density is up and running. Just something to get give me something to work with. So it's only gonna use quite small documents, you see, to start with. Why does it need to be classified into like different segments? Can we just fill a second class with junk that we don't care about? Like, I don't know, copies of Shakespeare or something. 'Cause if what we're looking for is the um frequency statistics, I don't see how that would be changed by the classification. I the Well there maybe another tool available? Yeah. Um I can't remember who's got it. Might be WordNet. But one of these big corpuses has a list of stop words that you can download and they're just basically lists of really uninteresting boring words that we could filter out before we do that. It's like that's one the papers I read, that's um one things they did right at the beginning is they've got this big s stop-list and they just ignore all of those throughout the experiment. Yeah, I it would be useful for me as well. It uh I think that'd be useful for me as well. Yeah. Yeah. Well all you really wanna do is look into getting some sub-set of the ICSI corpus off the DICE machines. 'Cause I hate working on DICE. It's awful. Like so I can use my home machine. ha has a C_D_ burner though. has a C_D_ burner. Yeah. The right-hand corner, far right. Yeah. How big is it without um the WAV files and stuff? 'Cause I could just say at um going over S_C_P_ one night and just leave it going all night if I had to. It's yeah, I mean the wave data are obviously not gonna get off there completely. Really? Oh right? I'll see if I can S_C_P_ it, I suppose. I've got a Linux box and a Windows box. So Broad-band. Put it on to C_D_. I can if I get down I can put to C_D_. Yeah. I'm not sure if there's enough space. Is how much do we get? Really? Okay. Yeah, but I can do it from that session, can't I? You can compress it from a remote session and S_C_P_ it from the same session? Do you think? Yeah. Oh no no, I was thinking of SSHing just into some machine and then just SCPing it from there. Yeah. I mean it has to go through the gateway. But Can you not do that? Mm, I see. Yeah. So you could just But th first, uh how big are the chunks? How big are the chunks you're looking at? So quite small then. So you could just um you could use just the same thing we used to build the big dictionary. You just do that on-line 'cause that won't take long to build a little dictionary that big, will it. I mean just use the same tool that we use. Yeah. Yeah. It doesn't need ordered, no. Um well that's the t are you using T_F_I_D_F_ for the information density? Alright, okay. Like 'cause frequency would be useful, I think. But um depending on the context, the size, and what we consider a document in the sense of calculating T_F_I_D_F_ is gonna change. Which might need thinking about. I think it would be useful, yeah. Well you need the raw frequency as well. But um you also need how many times things occur within each document. And um what we consider a document's gonna depend on our context, I think. 'Cause if we're looking at the whole lot of meetings, we'll consider each meeting a document in sort of terms of this algorithm. And if we're viewing like say just a small topic segment you might look at even each utterance as a small document. Yeah, but the thing is um It's gonna need some th th thought of how we Actually maybe it doesn't actually matter. Maybe if you just do it once at the highest level, it it will be fine. But I was just thinking it might be difficult to calculate the T_F_I_D_F_ off-line for all the different levels we might want. 'Cause if we're gonna allow disjoint segments for example, then how are we gonna know what's gonna be in context at any given time? But I suppose if you just did it globally, treating a meeting as a document, it'd probably still be work out fine, because you'd only be comparing to ones within the context. Uh I don't know, I thought were you gonna use that in the end? The information density. Oh sorry, that's what I mean. Like um yeah, for each word or whatever, but across the whole lot is what I mean by highest level. Like across the whole corpus. Yeah, but you'd probably look at each meeting as a document. Mm possibly. Are they big enough to get anything meaningful out of? Well yeah, that is not it's not an issue. You just concatenate an X_M_L_ file together. but we still want to have like a notion of meetings for the user. Yeah, sure. Yeah, you just like whatever you want to look at, you just jam together into an X_M_L_ file and that's your meeting, even though bits of it may come from all over the place or whatever. I mean I don't see why that's really a big problem. So basically what you're saying is you can take an arbitrary amount of data and process it with the same algorithm. It doesn't matter conceptually what that data is. It could be a meeting. it could be two utterances. it could be a meeting plus half a meeting from somewhere else. I don't think it's very difficult though. I mean what you do is you just build an X_M_L_ file, and if you want it to get down to the utterances, you'd go to the leaves. And then if you wanted the next level up, you'd go to the parents of those and like just go from like the leaves inwards towards the branch to build up things like um you know, when you click on a segment, it's gonna have like words or whatever that are important. As long as like the algorithms are designed um with it in mind, I don't think it's a very big problem. Well like say you had um like say for a meeting, right, you've got like uh say a hierarchy that looks quite big, like this. And like the utterances come off of here maybe. Then when whatever your algorithm is doing, as long as when you're working with utterances, you go for all the leaves, like then if you need something next up, so like a topic segment, you'd go to here. But if you were looking at say this one, so only went like this. Right, so you it's same, you'd start with the leaves, and you go oh, I want a topic segment. So I go one layer up. See, and then if you're working with just a topic segment in there, it's the only thing you have to worry about. And like each time you want a higher level, you just need to go up the tree. And as long as your algorithm respects that, then we can just process any arbitrary X_M_L_ file with whatever hierarchical structure we want. A meeting, say, and that would be a topic segment. So I think as long as you build an algorithm that respects whatever structure's in the file, rather than imposing its own structure Well no, it doesn't have to be. But I mean it could be as many nodes as you want. Like this one could be deeper maybe, say. So then you'd start with all your utterances here, and when you go up to get topic segments, you go to here here here here here here here. That might be a bit confusing though 'cause you have things on different levels. Well Wednesday. Yeah. Yeah. So we'll see if we can get like a mini-browser just displays two things synched together of some kind. Yeah. Yeah. It'd be useful. I don't know who you see about that though. I d have no idea. I've probably got a reasonable amount because um everything on my DICE account can actually be deleted 'cause I store it all at home as well. Is that guaranteed to stay, the Maybe you should send a support form. Just say we want some web space. Listen to. Yeah. 'Cause that'd be really useful is if we had a big directory. Especially for transferring stuff. Having said that, are we allowed to take a copy of the ICSI corpus? Something we should probably ask before we do it.. Okay. Okay. No, me neither. Might be funny to see what is summarised the whole corpus as anyway. I think it'd be very useful. But We can just change the code. Is that it? That's quite good. Yeah. I could just use it with the frequency, I think, until the information density thing's finished. That would be really useful. If you're doing it in Java, could you um serialize the output as well as writing it to a file? If you're doing it in Java, could you serialize the um dictionary, yeah, as well as writing it to a file? It's really easy. I don't see why it'd be any more massive than the file. Yeah. It just saves you parsing the um file representation of it. And now 'cause I would be using it in Java anyway. So I'd just be building the data structure again. Yeah, but it seems like a bit silly to be parsing it over and over again kinda thing. I would've thought that um I think all the collections and things implement serializable already. I think they might do. Tonight I'll try and um I'll either work some more on uh the T_F_I_D_F_ summarizer or do the audio thing. Yeah. Do we have to demonstrate something next week? Yeah. Yeah, I know. I think it's 'cause we had to specify it ourselves that it's not as um like focus the specification of most um work we have to do. Yeah. Once we start doing it it will all become more or less obvious I think anyway.
'Kay. Gosh. 'Kay. Is there much more in it than he d Is there much more in it than he said yesterday? Mm. Hmm. Hmm? Yeah, now I'd say if for the prototype if we just like wherever possible p chunk in the stuff that we have um pre-annotated and stuff, and for the stuff that we don't have pre-annotated write like a stupid baseline, then we should probably be able to basically that means we focus on on the interface first sort of, so that we we take the the ready-made parts and just see how we get them work together in the interface the way we want and and then we have a working prototype. And then we can go back and replace pieces either by our own components or by more sophisticated compo po components of our own. So it's probably feasible. The thing is I'm away this weekend. So that's for me Oh yeah, um yeah. No. But also I might like the the similarity thing, like my just my matrix itself for my stuff, I c I I think I can do that fairly quickly because I have the algorithms. Yeah, I think today's meeting is really the one where we where we sort of settle down the data structure and as soon as we have that, uh probably like after today's meeting, we then actually need to well go back first of all and look at NITE X_M_L_ to see in how far that that which we want is compatible with that which NITE X_M_L_ offers us. And then just sort of everyone make sure everyone understand the interface. So I think if today we decide on what data we wanna have now, and and later, maybe even today, we go and look at NITE X_M_L_ or some of us look at NITE X_M_L_ in a bit more detail, just trying to make some sense of that code and see how does the representation work in their system. And then sort of with that knowledge we should be able to then say okay, that type of NITE X_M_L_ data we wanna load into it, and this is how everyone can access it, and then we should be able to go from there. No. I've looked looked at the documentation and n like seen enough to make me think that we want to use the NITE X_M_L_ framework because um they have a good a event model that synchronizes sort of the data and and every display element. So that takes a lot of work away from us. Sort of that would be a reason for staying within their framework and using their general classes. But beyond that I haven't looked at it at all, which is something we should really do. Who actually like for this whole discussion I mean, who of us is doing stuff that is happening on-line and who of us is doing stuff that's happening off-line? Like my data is coming c Hmm? Yeah. Okay. Okay. 'Kay. So basically apart from the display module, the i the display itself, we don't have an extremely high degree of interaction between sort of our modules that create the stuff and and the interface, so the interface is mainly while it's running just working on data that's just loaded from a file, I guess. There isn't Yeah, I know. Th Yeah, the search is I guess the search is sort of a strange beast anyway because for the search we're leaving the NITE X_M_L_ framework. Um but that's still sort of that's good. That means that at least like we don't have the type of situation where somebody has to do like a billion calculations on on data on-line. 'Cause that would make it a lot more like that would mean that our interface for the data would have to be a lot more careful about how it performs and and everything. And nobody is modifying that data at at on-line time at all it seems. Nobody's making any changes to the actual data on-line. So that's actually making it a lot easier. That basically means our browser really is a viewer mostly, which isn't doing much with the data except for sort of selecting a piece piece of it and and displaying it. Hmm? Well some parts relevant for the search, yes. I'd say so. Hmm? Yeah, but nobody of us is doing much of searching from the data in the on-line stage. And for all together, like the display itself, I think we are easier if we if it's sitting on the X_M_L_ than if it's sitting on the S_Q_L_ stuff, because if it's sitting on the X_M_L_, we have the the NITE X_M_L_ framework with all its functionality for synchronizing through all the different levels whenever there's a change, whenever something's moving forward and stuff. And we can just more or less look at their code, like how their player moves forward, and how that moving forward is represented in different windows and stuff. So I think in the actual browser itself I don't wanna sit on the S_Q_L_ if we can sit on the X_M_L_ because sitting on the X_M_L_ we have all we have so much help. And for y for like the p the calculations that we're doing apart from the search, it seems that everyone needs some special representations anyway. You mean our results? Yeah, in in the NITE X_M_L_ X_M_L_ format, so with their time stamps and stuff, so that it's easy to to tie together st things. What I'm like what we have to think about is if we go with this multi-level idea, like this idea that sort of if you start with a whole meeting series as one entity, as one thing that you display, as one whole sort of, that then the individual chunks of the individual meetings, whereas and then you can click on a meeting, and then sort of the meeting is the whole thing and the chunks are the individual segments, that means sort of we have multiple levels of of representation, which we probably If we if we do it this way like we f we have to discuss that if we do it this way, then we should probably find some abstraction model, so that the interface in the sense like deals with it as if it's same so that the interface doesn't really have to worry whether it's a meeting in the whole meeting series or a segment within a meeting, you know what I mean? And that's probably stuff that we have to sort of like process twice then. Like for example that like the summary of a meeting within the whole meeting corpus or meeting series y is meeting series a good word for that? I don't really know what how to call it. You know what I mean, like not not the whole corpus, but every meeting that has to do with one topic. Um so in in the meeting se series so that a summary for a meeting within the meeting series, are sort of compiled off-line by a summary module. And that is separate from a summary of a segment within a meeting. 'Cause I don't think we can So are we doing that at all levels? Are we um And just have different like fine-grainedness levels sort of. Mm. 'Kay. So the only thing that yeah, so the only thing that would happen basically if I double-click let's say from the whole meeting series on a single meeting, is that the zoom level changes. Like the th the start and the end position changes and the zoom level changes. I I thought we couldn't do that. Like I was under the impression that we couldn't do that because we couldn't load the data for all that. But I don't know, I mean that So I'm s not sure if I got it. I was Mm-hmm. Mm-hmm. Mm-hmm. Mm-hmm. Okay. So Okay. I wa I was just worried about the total memory complexity of it. But I I completely admit, I mean, I just sort of like th took that from some thing that Jonathan once said about not loading everything. But maybe I was just wrong about it. How many utterances w Yeah, and I w yeah. Yeah. Yeah. Yeah. So what we have is we would have a word. Like we would have words with some priority levels. And they would basically be because even the selection would would the summaries automatically feed from just how prioritized an individual word or how indiv uh prioritized an individual utterance is? Or i are the summaries sort of refined from it and made by a machine to make sentences and stuff? Or are they just sort of taking out the words with the highest priority and then the words of the second highest priority? And the u okay. Are we doing it on th the whole thing on the utterance level? Or are we doing it on word level, like the information density calculation? We I think we have start and end times for words actually, but it's yeah, but it m it might s but it might sound crazy in the player. We should really maybe we can do that together at some point today that we check out how the player works. But there's maybe some merit in altogether doing it on an utterance level in the end. So Yeah. Well but also about the displays, I mean the displays in the in the text body, in the in the latest draft that we had sort of we came up with the idea that it isn't displaying utterance for utterance, but it's also displaying uh a summarised version in you know, like below the below the graph, the part. Maybe Yeah, r Hmm? Oh yeah, f it's just like there there's like audio skimming and there's displayed skimming. Yeah. Ma maybe there's some merit of going altogether for utterance level and not even bother to calculate I mean if you have to do it internally, then you can do it. But maybe like not even store the importance levels for individual words and just sort of rank utterances as a whole. Hmm? Yeah. 'Cause it it might be better skimming and less memory required at the same time. And I mean if you if you know how to do it for individual words, then you can just in the worst case, if you can't find anything else, just sort of make the mean of the words over the utterance. You know what I mean? W it's it's Well what's the smallest chunk at the moment you're thinking of of assigning an importance measure to, is it a word or is it an utterance? So we're thinking of like maybe just storing it on a per utterance level. Because it's it's less stuff to store probably for Dave in the in the audio playing. And for in the display it's probably better if you have whole utterances than I don't know, like what it's like if you just take single words out of utterances. That probably doesn't make any sense at all, whereas if you just uh show important utterances but the utterance as a whole it makes more sense. So it doesn't actually make a difference for your algorithm, 'cause it just means that if you're working on a word level, then we just mean it over the utterance. They are on Oh so that's good anyway then, yeah. Because that makes it a lot easier than to t put it on utterance level. Oh yeah. No but I mean like how how Jasmine does it internally I don't know, but it's probably, yeah, you probably have to work on word levels for importance. But there should be ways of easily going from a word level to an utterance level. Okay. Yeah, prob Hmm. Well we do a pre-filtering of sort of the whole thing, sort of like but that, like the problem with that is it's easy to do in the text level. But that would mean it would still play the uh in your audio, unless we sort of also store what pieces we cut out for the audio. Yeah. I think before we can like answer that specific question how we c deal with that, it's probably good for us to look at what the audio player is capable of doing. Yes. So what do you mean by buffering? Like you think directly feeding But yeah, but not but not stored on the hard disk and then loaded in, but loaded in directly from memory. But it's probably a stream if it exists in Java, it would be probably some binary stream going in of some type. Okay, yeah. Okay. Okay, so I mean so that means that there's probably, even if you go on an per utterance level, there's still some merit on within utterances cutting out stuff which clearly isn't relevant at all, and that maybe also for the audio we'd have to do. So let's say we play the whole au phrase, but then in addition to that, we have some information that says minus that part of something. That's okay, that we can do. Yeah, maybe even I mean that's sort of that depends on how how advanced we get. If maybe if we realise that there's massive differences in in gain or in something, you can probably just make some simple simple normalization, but that really depends on how much time we have and and how much is necessary. Yeah, if like I d I don't know anything about audio and I have never seen the player. So if you find that the player accepts some n input from memory, and if it's easy to do, then I guess that's that's fairly doable. So but that means in the general structure we're actually quite lucky, so we we have we load into memory for the whole series of meetings just the utterances and rankings for the utterances and some information probably that says, well, the I guess that goes with the utterance, who's speaking. Because then we can also do the display about who's speaking. Yeah. But I'm I'm still confused 'cause I thought like that's just what Jonathan said we do c that we can't do, like load a massive document of that size. On the other hand The other hand, I mean it shouldn't be like should be like fifty mega-byte in RAM or something, it shouldn't be massive, should it? Actually fifty hundred megabyte is quite big in RAM. Just thinking, what's the simp so We do get an error message with the project if we load everything into the project with all the data they load. So we know that doesn't work. So our hope is essentially that we load less into it. What's this lazy loading thing, somebody explain lazy loading to me. Ah, okay. So that is that only by type of file. Like if if if the same thing is in different files, would it then maybe like, you know, if if utterances are split over three or ten or w hundred different files, is then a chance maybe that it doesn't try to load them all into memory at the same time, but just So why does it fail then in the first place? Then it shouldn't ever fail, because then it should never Yeah, but yeah, but um it uh it it failed right when you load it, right, the NITE X_M_L_ kit, so that's interesting. Hmm. Let's check that out. Um I'll p I'll probably ask Jonathan about it. So alternatively, if we realise we can't do the whole thing in one go, we can probably just process some sort of meta-data, you know what I mean, like sort of sort of for the whole series chunks representing the individual meetings or some Like something that represents the whole series in in a v in a structure very similar to the structure in which we represent individual um meetings, but with data sort of always combined from the whole series. so instead of having an single utterance that we display, it would probably be like that would be representing a whole um topic, a segment in a meeting. And sort of so that wi using the same data st Well, in a sense Uh I'm I'm thinking of in a sense of like creating a virtual a virtual meeting out of the whole meeting series, sort of. Yeah, sort of like off-line create a virtual meeting, which which basically treats the meeting series as if it was a meeting, and treats the individual meetings within the series as if they were segments, and treats the individual segments within meetings as if they were um utterances. You know, so we just sort of we shift it one level up. And in that way we could probably use the same algorithm and just like make vir like one or two ifs that say okay, if you are on a whole document uh a whole series level and that was a double-click, then don't just go into that um segment, but load a new file or something like it, but in general use the same algorithm. That would be an alternative if we can't actually load the whole thing and 'Cause also like even if we maybe this whole like maybe I'm worrying too much about the whole series in one thing display, because actually I mean probably users wouldn't view that one too often. Yeah, but I'm I'm still worried. Like for example for the display, if you actually if you want a display uh like for the whole series, the information density levels based on and and the f and the only granularity you have is individual utterances, that means you have to through every single utterance in a series of seventy hours of meetings. Yeah. Yeah, and if you make that structurally very similar to the the le like one level down, like the way how we uh store individual utterances and stuff, then maybe we can more or less use the same code and just make a few ifs and stuff. Yeah, so so but still so in in general we're having we're having utterances and they have a score. And that's as much as we really need. And of cou and they also have a time a time information of course. Hmm? And a and a s and a speaker information, yeah. Yeah, so an information which topic they're in, yeah. And and probably separate to that an information about the different topics like that Yeah. So so the skimming can work on that because the skimming just sort of sorts the utterances and puts as many in as it needs Yeah. Yeah, it'll it'll play them in some order in which they were set because otherwise it's gonna be more entertaining. Um but that that's enough data for the skimming and the the searching, so what the searching does is the searching leaves the whole framework, goes to the S_Q_L_ database and gets like basically in the end gets just a time marker for where that is, like that utterance that we are concerned with. And then we have to find I'm sure there's some way in in NITE X_M_L_ to just say set position to that time mark. And then it shifts the whole frame and it alerts every single element of the display and the display updates. Yeah, yeah. That we can ju yeah, but so so if if somethi so yeah. So if in that tree display somebody clicks on something Yeah, and then you sort of feed the time stamp to and the NITE X_M_L_ central manager, and that central manager alerts everything that's there, like alerts the skim like the the audio display, alerts the text display, alerts the visual display and says we have a new time frame and then they all sort of do their update routines with respect to the current level of zoom. So how much do they display, and starting position at where the or maybe the mid-position of it, I don't know, like w if start where the thing was found or if that thing wa was found it's in the middle of the part that we display, that I don't know. But that we can decide about, but a general sort of It's the same thing if like whether you play and it moves forward or whether you jump to a position through search, it's essentially for all the window handling, it's the same event. It's only that the event gets triggered by the search routine which sort of push that into NITE X_M_L_ and says please go there now. Why do we have to do it in memory? But that stuff's so I mean like the information is coming from off-line. So we probably we don't even have to change the utterance document, right, because the whole way, like the whole beauty of the NITE X_M_L_ is that it ties together lots of different files. So we can just create an additional X_M_L_ file which for every utterance like the utterances have I_D_s I presume, some references. So we just we tie uh p just a very short X_M_L_ file, which it's the only information it has that has whatever a number for for the um weight, for the information density, and we just tie that to the existing utterances and tie them to the existing speaker changes. Well otherwise we probably have to go over it and like add some integer that we just increment from top to bottom sort of to every utterance as an as an I_D_ some type. Or un or try to understand how NITE X_M_L_ I_D_s work and maybe there's some special route we have to follow when we use these I_D_s. It's alm hmm? Yeah, the the girl said the utterances themselves are not numbered at the moment. Okay. Okay. Okay. Yeah. So I guess that would be solvable if not. Mm-hmm. Sorry? Okay. Okay. Is that a board marker pen actually? Oh. That's just so like to make a list of all this stuff, or we probably can somebody can do it on paper. All these fancy pens. So what so the stuff we have we have utterances and speakers and weights for utterances. So for for every utterance sort of like the utterance has a speaker and a weight which is coming from outside. Or we just tie it to it. And there is segments, which hmm? Oh, so sorry um. Uh topic s topic segments I meant. Like they are they are a super-unit. So so the utterances are tied to topic segments. And if the time stamps are on a word level, then we b somehow have to extract time stamps for utterances where they start. W what segments now? Okay. Is the uh is that the same as utterances that is that the same as utterances that Mm-hmm. Mm-hmm. What so that's Oh. But that's one o one segment or is that two segments then? Yeah. Okay. Okay. So but but generally utterances is that which we just called uh sorry, segments is that which we just called utterances now. Like it's it's the sa it's sort of like one person's contribution at a time sort of thingy dingy. Okay, so yeah, so we have those, and and then we have some f field somewhere else which has topics. Yeah, and and a topic's basically they are just on the I_D_, probably with a start time or something, and and the utterances referenced to those topics I guess. So the topics don't contain any redundant thing of like showing the whole topic again, but they just sort of say a number and where they start and where they finish. And the utterances then say which topic they belong to. Yeah. No. But I was thinking of the topic segmentation now and and f for that there would only be one, right, because it's sort of like it's just a time window. Yeah. So if this lazy loading works, then this should definitely fit into I mean not memory then because it wouldn't all be in memory at the same time. So if we just have those sort of that information like a long list of all the utterances slash segments and like short or smaller lists which give weight to them. And even though probably if there's a lot of over-head in having two different files, we can probably merge the weights into it off-line. You know what I mean, like if if there's a lot of bureaucracy involved with having two different trees and whether one ties to the other because the one has the weight for the other, then it's probably quicker to just Yeah, I thought that was the whole beauty that like you can just make a new X_M_L_ file and sort of tie that to the other and and it tre Oh yeah. So no, I didn't I didn't mean tree. No. No. I meant just like handling two different files internally. Sort of c I was just thinking you know like if if the overhead for having the same amount of data coming from two d files instead of from one file is massive then it would probably be for us easy to just like off-line put the the weight into into the file that has the segments, uh yeah, segments slash utterances already. But that we can figure out I mean if it's going horrendously wrong. Yeah. Yeah. Yeah, no, we'd we'd be completely using like the whole infrastructure and basically just I mean the main difference really between our project and theirs really is that we load a different part of the data. But otherwise we're doing it the same way that they are doing it. So we just we're sort of running different types of queries on it. We in a sense we I think we are running queries, it's not just about um what we load and what we don't load, but we're l running queries in the sense that we dynamically select by by weights, don't we? That we have to check how fast that is, like to say give us all the ones that whether that works with their query language, whether that's too many results and whether we shou You know, if 'cause if it i let's say I mean if if their query language is strange and if it would return b ten million results and it can't handle it, then we can just write our individual components in the way that they know which what the threshold is. So they still get all the data and just they internally say oh no, this is less than three and I'm not gonna display it or something. Hmm? Yeah. No. I'm just thinking for this whole thing of like a different level, sort of cutting out different different pieces, whether we do that through a query where we say give us everything that's ab above this and this weight, or whether we skip the same infrastructure, but every individual module like the player and the display say like they still get sort of all the different utterances, uh all the different pieces, but they say oh, this piece I leave out, because it's below the current threshold level. When do we need the one for the meet Okay. Yeah, I guess for the so when we have the display, will we display the whole series. Then if we have for the individual topic segments within the meetings if we have ready calculated disp um measures, then we don't have to sort of extract that data from the individual utterances. Yeah, and that's also fairly easy to store along with our segments, isn't it. For the segments, are we extracting some type of title for them that we craft with some fancy algorithm or manually or we're just taking the single most highly valued key-word utterance for the segment heading? Hmm. Hmm. It's probably like in in the end probably it wouldn't be the best thing if it's just the high most highly ranked phrase or key-word because like for example for an introduction that would most definitely not be anything that has any title anywhere similar to introduction or something. Yeah. Also like for this part, maybe if we go over it with named entity in the end, if I mean w if one of the people doing DIL has some named entity code to spare, and just like at least for the for sort of for finding topics, titles for for segments, just take a named entity which has a really high, what's it called, D_F_I_D_F_, whatever. 'Cause you'd probably be quite likely if they're talking about a conference or a person, that that would be a named entity which is very highly fr um frequented in that part. Yeah, he said they're quite sparse. So that basically was don't bother basing too much of your general calculation on it. But like especially if they're sparse, probably individual named entities which describe what a what a segment is about would probably be quite good. Like if there's some name of some conference, they would could probably say that name of the conference quite often, even though he's right that they make indirect references to it. Anyway Sorry? So you're doing that on a on a per word level. Okay. Okay. Okay, cool. I was just wondering where you had the corpus from at the moment. So it it seems that the data structure isn't a big problem and that basically we don't have to have all these massive discussions of how we exactly interact with the data structure because most of our work isn't done with that data structure in memory in the browser, but it's just done off-line and everyone can ha represent it anyway they want as long as they sort of store it in a useful X_M_L_ representation in the end. So like Yeah, that would mean understanding the NITE X_M_L_ X_M_L_ sort of format in a lot more detail. We should I think we should just have a long session in the computer room together and like now that we know a bit more what we want, take a closer look at NITE X_M_L_. Mm-hmm. Mm-hmm. Good. Yeah, I haven't looked at this stuff much at all. Yeah. Yeah. Who's who's sort of doing the the the central coordination of of of the browser application now? Like Hmm? Yeah, or but also like all these elements like like the loading and, yeah, integration and and like handling the data loading and stuff. Nah. I'm sort of like I think I'll take over the display, just because I've started with a bit and found it found it doable. So somebody should sort of be the one person who's who understands most about what's t centrally going on with with the with the project, like with the with the browser as a whole and where the data comes in and Any volunteers? It's also a complicated one. Yeah. I know but uh b I guess we can do it like several people together, it's probably just those people have to work together a lot and very closely and just make sure that they're always f understand what the other one is doing. Yeah, or or ready-made versions of them for that matter and Yeah, but I think actually like at the moment the integration comes first, I mean it's sort of at the moment the building the browser comes first, and then only comes the creating new sophisticated data chunks, because that's sort of the whole thing about having a prototype system which is more or less working on on chunk data. But it at least we have the framework in which we can then test everything and and look at everything. 'Cause before we have that, it's gonna be very difficult for anyone to really see how much the work that they're doing is making sense because you just well I guess you can see something from the data that you have in your individual X_M_L_ s files files that you create, but it would be nice to have some basic system which just displays some stuff. Or just adapt like their like just sort of go from their system and and adapt that piece for piece and see how we could how we could arran like adapt it to our system. Does anyone want to like just sit with me and like play for three hours with NITE X_M_L_ at some point? Uh I wouldn't like to be 'cause I'd like to go to the gym. I'm theoretically free. But if there's any time t hmm? You have nothing no free time on Wednesday. Hmm. Nine 'til twelve and then nothi you have or you Hmm? Anytime Wednesday afternoon I'd be cool, I think. Yo, Forrest Hill, whatever one's easier to discuss stuff, I don't know. I'm not biased. Okay. What time do you wanna do? Okay, so I'll just meet you in in eighteen a in the afternoon. I guess at the moment nobody critically depends on like the NITE X_M_L_ stuff working right now, right? Like at the moment you can all do your stuff and I can do my L_S_A_ stuff. And I can even do the display to a vast degree without actually having their supplying framework working. So it's not that crucial. Yeah, actually I need the raw text as well. Yeah, but I was I was I was more thinking of the sort of the the whole browser framework as a running programme now. Yeah, I think we all need the raw text in different in different flavours, don't we? But number within the X_M_L_ context. Are they spoken numbers? Like do they look like they're utterances numbers? There's the number task, isn't there. That's part of the whole thing. Hmm? Okay. Hmm. Yeah, we have to probably cut that out anyway for our project, I don't know. It's probably gonna screw up a lot of our data otherwise. If Not sure if it what it does to document It would probably make the yeah, if if you have segments for that, probably the Okay. Uh I'm just thinking like it pro it pro probably like the L_S_A_ would perform quite well on it. It would probably find another number task quite easily seeing that it's a constrained vocabulary with a high co-occurrence of the same nine words. So that wou ten word. Hmm? Yeah. I think it's also something that they they said the numbers in order, right? Yeah, I think it's it the it sounded like they wanted to check out how well they were doing with overlapping and stuff, because basically it's like they're reading them at different speeds, but you know in which order they are said. Anyway. ICSI has some reasons for doing it. They must have been pissed off saying like numbers at the end of every meeting. Um Dave, if you would or actually for well, if you're doing I_D_F_s or you whatever you call your your frequencies, I always mix up the name, uh you need some dictionary for that at some point though, like you need to have some representation of a word as not not that specific occurrence of that word token, but of of of a given word form. Because you're making counts for word forms, right? Yeah, so we should work together on that, because I need a dictionary as well. Okay. 'Kay. Okay. Didn't you say that the o the ord Yeah, but for I'm just wondering for the whole thing. Does somebody wo who was it of you two who said that um there's some programme which spits out a dictionary probably with frequencies? Okay. Is anyone of you for the for the document frequency over total frequency, you gonna have total frequencies of words then with that, right? Like over the whole corpus sort of. Or W using which tool are you talking about? Be careful with that. Like my experience with the British National Corpus was that there's far more word types than you ever think because anything that's sort of unusual generally is a new word type. Like any typo or any strange thing where they put two words together. And also any number as a word type of its own. So you can easily end up with hundred thousands of words when you didn't expect them. So generally dictionaries can grow bigger then you think they do. Well you can probably also you can probably pre-filter like with regular expressions even just say if it consists of only dig digits, then skip it, or even if it consists any special characters, then skip it because it's probably something with a dot in between, which is usually not something you wanna have and What I did, for my project I just ignored the hundred most frequent words, because they actually end up all being articles and and everything and stuff. So we need like several of us need a dictionary. Am I the only one who needs it with frequencies? Am I the only one who needs it with frequencies? Or Frequencies. Yeah. Well I guess as soon as we have the raw text, we can probably just start with the Java hash map and like just hash map over it and see how far we get. I mean we can probably on a machine with a few hundred megabyte RAM you can go quite far. You can write it on beefy. So even if it goes wrong and even if it has a million words be Oh yeah, burning it on a like we should be able to burn the whole corpus, just the X_ hmm? Ah I see, I asked support about that two days ago. In the Informatics building there oh sorry, in in Appleton Tower five the ones closest t two machines closest to the support office. So I presume oh wait, I have the exact email. I think he's talking about sort of the ones that Yeah, if you if you enter the big room, in the right-hand corner, I think. Um the thing is like you can only burn from the local file-system. So if it's from s well actually I think if it's mounted, you can directly burn from there, but the problem is I have my data on beefy and so I have to get it into the local temp directory and burn it from there. But you can burn it from there. Uh we looked that up and I for we looked that up and I forgot. Yeah yeah. No, you you we should be able to get it at I don't think it was I don't think it was a gigabyte. Hmm. See I would off I would offer you to to get it on this one, and then um like copy it. But you know what I figured out, I'm quicker down-loading over broad-band into my computer than using this hard disk. There's something strange about the way how they access the hard disk, how they mount it, which is unfortunate. Hmm. What operating system do you have? Okay. Wh what connection do you have at home? Yeah. So if anyone of us gets it, we can then just use an ext hmm? Yeah, burn it to C_D_ or, yeah, put it on on hard disk, whatever. Question is if you're not quicker if you uh because you should get massive compression out of that. Like fifty percent or something with a good algorithm. So if you could compress it and just put it into a temp directory. Like The temp the temps usually have for gigabyte three or two. The temps, yeah. I do like I mean there's not guarantee that anything stays there, but overnight it'll stay. And I think the temps usually have. Ah yeah, but that would have to be the temp directory off the machine you can S_S_H_ into directory of S_S_H_. Yeah, they wou they'd they'd probably hate you for doing it. But They'd probably they'd like you more if you S_S_H_ uh into another computer, compress it there and then sort of copy it into the into the gateway machine. They have um if you S_S_ hey, you know, if you if you S_S_H_ and they have this big warning about doing nothing at all in the gateway machine. Yeah. To your home machine. I haven't I haven't figured out how to tunnel through the gateway into another machine yet. It's not it's not easy definitely. That's why I end up sort of copying stuff into the temp directory at the gateway machine. Sorry if this is boring everybody else. This is just details and how to get stuff home from what we can probably just look at that together when we're meeting. I'm sorry. Mm-hmm. Well yeah. As soon as somebody gives me the raw text of the whole thing, I can probably just implement like a five line Java hash table frequency dictionary builder and see Oh, did you not say frequencies f of words in the whole sorry, did uh So you'd you Yeah, you'd have to count it yourself, yeah. Oh, you don't wanna have different counts for each chunk, but just like sort of for for something from old chunks. Oh yeah, no, that's yeah, so once I write an ar like w if I write like an algorithm which does a hash um table dictionary with frequency from a raw text, then the raw text can be anything. So how far are we g uh how f how far are you getting raw text out of it do you think? Okay, well that's good, because for the dictionary the order doesn't make a difference, does it? So yeah, so um I'll get that from you and I'll write the hash table which goes over that and creates a dictionary file. So for the dictionary, is it okay if I do, whatever, word blank frequency or something? Just p could everybody sort of start from that? I mean I guess we can Yeah, I I need frequency as well. Well I think we might have a lot in common what we calculate because I for my latent semantic analysis need like counts of words within a document, uh within a a segment actually, within a topic segment. Can I convert these probabilities back into frequencies? Okay. Oh, so that's what f Rainbow does, because that's what L_S_A_ builds on. Like it builds a f a document by frequency matrix. So I could probably get that. Even though but I already have I already have my code to build it up myself. No, don't bother. I have my code already. Um Yeah, so Dave, you said you need the frequency counts actually for per document, would you say, not for the whole thing? It more and more appears to me that if we if we scrap the notion of the meeting as an individual thing and sort of ju see meetings as as topic segments and have sort of like hierarchical topic segmentation instead, then it's b like a more coherent framework. Wait, are we are we using this um for the for the for the do for the weighting in the end now, this this measure you're calculating? Because if we're doing Like I think for for the information density we uh we should calculate it on the lowest level, not on the highest. But like 'cause Yeah, but w it don't you have to like go sort of like for in a document versus the whole thing? Isn't that how it works that you c look look at r I don't think that's a good idea because isn't it like that we expect th there to change over i b with the different topic segments more? That they talk about something different in each different topic segment. 'Cause that's what relative term frequency is about, that like in some context they're talking more about a certain word than in general. So that would more be the the topic segments then. I don't know. Yeah. Yeah. Yeah. So I'm just wondering if there's ways to abandon the whole concept of of meetings and sort of but just not really treating separate meetings as too much of a separate entity. But But on algorithmic level, whether we actually whether there's some way to just represent meetings as as topics. Hmm. That's not really what I meant. But I think I have to think more about what I meant. Um g I'm confused about everything. Yeah. I'm I'm not so concerned about the m a meeting plus something else, I'm more talking about like, yeah, the keeping keeping the same algorithm and the same way of handling it and just saying like just this this topic here i uh it happens to be like a whole meeting and it has sort of sub-topics, so just that sort of topics a hierarchical concept where like a topic where there can be super-topics and topics, and the super-topics are in the end what the meetings are, but in general at some level super-topics are treated like like topics. Hmm. Mm I'm not really sure what I want. So sorry, could describe that again, the Mm-hmm. Mm-hmm. Mm-hmm. So that would be the series as a whole. That would be sort of m meetings, yeah. Yeah. I'm a I'm a I'm a bit brain-damaged at the moment, but I think I'll just sit together with you again and and go through it again. Hmm. So so I'll is th it like is this and this structurally then always identical? So that we can that we can treat it with the same algorithm or Yeah, I'm also not sure how we can go from from bottom-up. I have always thought it's like more that oh, whatever, I'm a can't think of it at the moment. Probably this is all too complicated worrying about that at that moment anyway. Now have have we have we decided anything, are we doing anything? S Wednesday we are meeting and looking at their at their implementation in some more detail to actually understand what's going on. We had two things from their stuff just to make sure that we are like understand it, we understand it enough to to m modify it. Yep. How would we do that? By just making like it w read write for everyone. 'Kay, who has most free space on their Same here. Well we alternatively we can probably just make another directory on the beefy scratch space. I mean that's where I'm having gigabytes and gigabytes of stuff at the moment. No. No. Yeah. But I think if he sends to the I think if he sends to the port he'd probably be in a better position. Yeah. Hmm. I think he said yes to that. I think uh that was like in when we were still in the seminar room, I asked that once or like ask is it possible to get it off and nobody said like people were discussing about the technical practicalities, but nobody said anything about al being allowed to or not allowed to. I mean, we have access to it here and I guess it probably means that we we can't give it to anybody else. But but if they give us access to it here o sitting on a DICE machine, then there shouldn't be a reason why we shouldn't be able to use it on our laptop. I personally don't have too many friends who would be too keen on getting it anyway. I have that really excited pirate copied thing. It annotated meeting data. Huh. Wait, wait, wait. Um sorry. Yeah, sorry. What I just realised, we should really t keep different serieses completely separate for virtually all purposes. Just let's be careful about that, because like the the ICSI corpus isn't isn't one meeting series, it's several meeting series with different people meeting for completely different things. For each meeting. Alright. Okay, but like let's just be careful that whatever we sort of we merge together, that like the highest level of merging, it's not the whole ICSI corpus but individual series.. I think we might actually I think That's probably be somewhere like well or something like it. Um I think we might just get away with for the whole project just like looking at only one series and just doing within one series. I mean you can do everything you want in one series. Oh yeah, let's take that. Is the is the data always clearly split up by different series? Uh like is it easy to just pick one Okay. Okay. Okay. Okay. So at at every level everyone has to be careful to really just take even at the highest level, just take stuff from one series and not merge stuff from different series together because they would probably be just majorly messy. Yeah, so so t so like if even if we make one single text file which has the whole corpus, sort of our corpus, that would still be from one series only. Wou but it what you're producing at the moment is like individual text files that sort of have the raw text for a whole a meeting as a whole or Mm-hmm. Yeah. 'Kay. Um so is is anybody creating an uh a real raw text thing at the moment, like which is just the words? Yeah, tha 'cause that's what I'm gonna need as well. But i but if there uh b aren't like so it's it's start and end times just for the file. Like is it just the first and the last line? Or is it for every single thing in So what do you mean by just not print out that? Okay. If you're into it, can you make a text file which just like makes just the words? 'Kay. Do you want it straight flowing, 'cause I would need something that marks the end of uh of uh is is yours segmented by topics then that like is there any information that you have to the topic, to the automated topic topic segmentation? Oh then I need something different later anyway. Okay, but for now, if you c Okay. You're gonna put that as an output of yours, the segmentation. Okay, so for now can you create like sort of just uh a dump which is pure text, just pure text so that I can get a dictionary and you can work on that for your topic segmentation. And Or for for the series. But I can but I can also deal with separate files, I mean I can just write the algorithm that it loads all files in a directory or something. But I mean if you But if you can put it in one single mega-file, that would be quite useful for me. Even though for you, wouldn't it be easier if you had different files because then you sort of know like Yeah. So give m give me different files as long as like it m if you could name them in a way that is easy to enumerate over them, like whatever, one two three four five or something. Or just anything that I can Yeah. Is is it something that's easily enu like to enumerate over? Is it some just some ordered pattern? Okay, cool. Okay, cool. Yeah. In the right order. It's just a wish list. Orders. When do you think you'll have um like a primitive segmentation by some ready-made topic segmentation by some ready-made tool ready? Okay. Okay, cool. 'Cause I'll need that then when it's done. Okay. Mm-hmm. What's what's nine megabyte? The the That sounds quite reasonable. That's nine nine characters over okay. Okay. Okay. That is for are we are we picking one particular series at the moment? Or Yes. Okay. Yeah. Yeah, I guess we can probably process the data for all different series and then check which series is the best for the presentation. It sounds quite reasonable, nine megabyte. I mean if you think if it's r roughly a million words and nine characters per word sounds realisti Yeah. Yes, I'm gonna build a dictionary then from that. Like just a list of the words that maybe a list of the words with the frequencies or a list of the words sorted alphabetically or numerically. What what does anyone want? Does this there any wishes for dictionaries? So I'll create a dictionary. Add add the structure, yeah. And then the actual file we can probably like copy from your home directory or something like it. Yeah yeah, but I'm sa I'm saying for the whole thing in the end. Then like the big thing we probably shouldn't do by email. Yeah. Oh, from the time I get the file I can do that in an afternoon, the next sort of the next morning. Oh, you mean how long processing time it takes. Ah, it's a it's a bog standard algorithm. I've I've sort of I've written it for for DIL just in half an hour or something similar. It's just you put them in a hash table and and say well if it exists already in the hash table then you increase the count by one and I'll probably implement some filter for filtering out numbers or something. Really? How do you do that? Okay, well I don't know any Perl. I mean if anyone wants to do a Perl script for that that does it does it nicely, I uh I've no problem with that. I but I think I have the Java code virtually ready because for DIL I wrote something very similar. Like for DIL I wrote something that counts the the different occurrences of all the tags um Sorry? The hash table? Uh I've never serialized anything. Wouldn't that be absolutely massive though? And then seriali and then write the serialization to a file. So you want like a se like a file which is the serialization of a hash table. Okay. Yeah. I I'll I'll check if I understand how it works. I mean otherwise I can give you the code for loading a dictionary. Give you my my it's just it's it's sort of it's a line break separated file, you know. Yeah. Yeah, I'll see if I understand how to serialize. There's a there's a serialise command so that gives me one mega mother of a s Yeah, but do they automatically write to the file anyway I'll I'll figure that out. We don't have to Yes, is that pretty much pretty much it? So Dave and me look at how NITE X_M_L_ works and we're Hmm. I'll build a dictionary as soon as I get the text. And yeah, so that When do we have to meet again then with this? How are we gonna do a demonstrator next week? My God. No no, not demonstrate, but like didn't you say that uh didn't we sort of agree that it would be useful to have a demonstrator of it, like some primitive thing working next week. That's gotta be very prototype. Mm-hmm. Ah well, let's go. Sorry. I feel like like hanging mid-air and not really like finding a point where you can get your teeth into it and start working properly and so it's all so fuzzy the whole Yeah, but it at the moment but at the moment it's also an implementational level. Like with the data structures, I'm just like over these vague ideas of some trees, I'm f Yeah. It's just we are half-way through the project time table. That's just what freaks me out. Um
